Pumping Documents Through a Domain and Genre Classification Pipeline

نویسندگان

  • Udo Hahn
  • Joachim Wermter
چکیده

We propose a simple, yet effective, pipeline architecture for document classification. The task we intend to solve is to classify large and content-wise heterogeneous document streams on a layered nine-category system, which distinguishes medical from non-medical texts and sorts medical texts into various subgenres. While the document classification problem is often dealt with using computationally powerful and, hence, costly classifiers (e.g., Bayesian ones), we have gathered empirical evidence that a much simpler approach based on n-gram-statistics achieves a comparable level of classification performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning to classify documents according to genre

Current document retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis that is, the ability to distinguish documents according to style would be a usefu...

متن کامل

Genre Classification of Web Documents

Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in m...

متن کامل

Thesis Stereotyping the Web: Genre Classification of Web Documents

OF THESIS STEREOTYPING THE WEB: GENRE CLASSIFICATION OF WEB DOCUMENTS Retrieving relevant documents over the Web is a difficult task. Currently, search engines rely on keywords for matching documents to user queries. This paper explores the potential for discriminating documents based on the genre of the document. I define genre as a taxonomy that incorporates the style, form and content of a d...

متن کامل

Towards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage

We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...

متن کامل

Detailed Scheduling of Tree-like Pipeline Networks with Multiple Refineries

In the oil supply chain, the refined petroleum products are transported by various transportation modes, such as rail, road, vessel and pipeline. The latter provides one of the safest and cheapest ways to connect production areas to local markets. This paper addresses the operational scheduling of a multi-product tree-like pipeline connecting several refineries to multiple distribution centers ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004